%%html
<style>
/* Jupyter */
.rendered_html table,
/* Jupyter Lab*/
div[data-mime-type="text-markdown"] table {
margin-left: 0
}
</style>
# Basic Libaries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
warnings.filterwarnings('ignore')
## Data Visualization Libraries
import seaborn as sns # data plots and visualization
sns.set(style="white")
sns.set(color_codes=True)
import matplotlib.pyplot as plt # data plots
%matplotlib inline
from IPython.display import HTML
vehicle = pd.read_csv('vehicle.csv')
vehicle.head(10)
vehicle.shape
vehicle.dtypes
vehicle.columns
Vehicle data contains 846 observations with 19 variables, all numerical and one categorical defining the class of the objects.
The following table gives a brief introduction of all features.
| Attribute | Data Type | Description |
|---|---|---|
| compactness | Numeric | Compactness (Average perimeter ** 2/Area) |
| circularity | Numeric | Circularity (Average radius ** 2/Area) |
| distance_circularity | Numeric | Distance Circularity (Area/Average distance from border ** 2) |
| radius_ratio | Numeric | Radius Ration (Maximum radius - Minimum radius/Average radius) |
| pr.axis_aspect_ratio | Numeric | Principal Axis Aspect Ratio (Minor axis/Major axis) |
| max.length_aspect_ratio | Numeric | Maximum Length Aspect Ratio (Length perpendicular maximum length/ Maximum length) |
| scatter_ratio | Numeric | Scatter Ratio (Inertia about minor axis/Inertia about major axis) |
| elongatedness | Numeric | Elongatedness (Area/Shrink width** 2) |
| pr.axis_rectangularity | Numeric | Pricipal Axis Rectangularity (Area/ Principal axis length * Principal axis width) |
| max.length_rectangularity | Numeric | Max Length Rectangularity (Area/Maximum length * Length perpendicular to this) |
| scaled_variance | Numeric | Scaled Variance Along Major Axis (2nd order moment about minor axis/Area) |
| scaled_variance.1 | Numeric | Scaled Variance Along Minor Axis (2nd order moment about major axis/Area) |
| scaled_radius_of_gyration | Numeric | Scaled Radius of Gyration (Maximum variance + Minimum variance/Area) |
| scaled_radius_of_gyration1 | Numeric | Scaled Radius of Gyration 1 (Maximum variance + Minimum variance/Area) |
| skewness_about | Numeric | Skewness |
| skewness_about.1 | Numeric | Skewness About Major Axis (3rd order moment about major axis/sigma_minor** 3) |
| skewness_about.2 | Numeric | Skewness About Minor Axis (3rd order moment about minor axis/sigma_major** 3) |
| hollows_ratio | Numeric | Hollows Ratio (Area of hollows/Area of bounding polygon) where sigma_maj 2 is the variance along the major axis and sigma_min 2 is the variance along the minor axis, and area of hollows = area of bounding poly-area of object |
| class | Categorical | Number of Car Classes (3): Car, Bus, Van |
vehicle.info()
As the number of non null observations are not equal to the number of entries, there are likely chance of presence of missing values.
#Create a new function:
def num_missing(x):
return sum(x.isnull())
#Applying per column:
print("Missing Values in Attributes (if any):")
print(vehicle.apply(num_missing, axis=0))
# Missing Data correction by dropping null values
vehicle.dropna(inplace=True)
vehicle.shape
Out of 846 records, 33 records were dropped with NaN values. Now only 813 records remain. \ Note: Dropping is only advised to be used if missing values are few (say 0.01–0.5% of our data). Percent is just a rule of thumb.
## Cross Validate for presence of any missing values
#Create a new function:
def num_missing(x):
return sum(x.isnull())
#Applying per column:
print("Missing Values in Attributes (if any):")
print(vehicle.apply(num_missing, axis=0))
# Check if there are rows having ?
vehicle[vehicle['circularity']=="?"]
vehicle[vehicle['distance_circularity']=="?"]
vehicle[vehicle['radius_ratio']=="?"]
vehicle[vehicle['pr.axis_aspect_ratio']=="?"]
vehicle[vehicle['scatter_ratio']=="?"]
vehicle[vehicle['elongatedness']=="?"]
vehicle[vehicle['pr.axis_rectangularity']=="?"]
vehicle[vehicle['scaled_variance']=="?"]
vehicle[vehicle['scaled_variance.1']=="?"]
vehicle[vehicle['scaled_radius_of_gyration']=="?"]
vehicle[vehicle['scaled_radius_of_gyration.1']=="?"]
vehicle[vehicle['skewness_about']=="?"]
vehicle[vehicle['skewness_about.1']=="?"]
vehicle[vehicle['skewness_about.2']=="?"]
No Question Mark ("?") sign present in any data attribtes
Check for duplicate data
dups = vehicle.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
vehicle[dups]
No duplicate rows exists and hence no duplicate removal step is required
vehicle.describe(include = ["object"]).transpose()
df = vehicle.describe().transpose()
dfStyler = df.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
for feature in vehicle.columns:# Loop through all columns in the dataframe
if vehicle[feature].dtype == 'object': # Only apply for columns with categorical strings
vehicle[feature] = pd.Categorical(vehicle[feature]) #Replace strings with an integer value
replaceStruct = {"class": {"bus": 1,"car": 2, "van": 3}}
# Update Original data values with categorical values
vehicle_new = vehicle.replace(replaceStruct)
# Illustrate Top 10 rows of modified data
vehicle_new.head(10)
PairPlot: To plot multiple pairwise bivariate distributions in a dataset, the pairplot() function is used. This creates a matrix of axes and shows the relationship for each pair of columns in a DataFrame. By default, it also draws the univariate distribution of each variable on the diagonal Axes.
sns.pairplot(vehicle_new, hue="class", diag_kind = 'kde', palette="husl", markers=["s", "*", "+"])
plt.show()
# Class Legend: s (value = 1) - bus, * (value = 2) - car, + (value = 3) - van
# Correlation Matrix Table
corr = vehicle_new.corr()
cmap = cmap=sns.diverging_palette(255, 5, as_cmap=True)
def magnify():
return [dict(selector="th",
props=[("font-size", "9pt")]),
dict(selector="td",
props=[('padding', "0em 0em")]),
dict(selector="th:hover",
props=[("font-size", "12pt")]),
dict(selector="tr:hover td:hover",
props=[('max-width', '100px'),
('font-size', '12pt')])
]
corr.style.background_gradient(cmap, axis=1)\
.set_properties(**{'max-width': '100px', 'font-size': '9pt'})\
.set_caption("Hover to Magify")\
.set_precision(2)\
.set_table_styles(magnify())
# Representing coorelation through heatmap
fig, ax = plt.subplots(figsize=(20,12))
sns.heatmap(corr, center=0, cmap=cmap, annot=True, annot_kws={"size": 13})
plt.show()
Correlation Inferences
As variables are highly correlated or co-linear and hence can cause model overfitting or might run into the multicollinearity conundrum. Also, in plain English if two variables are so highly correlated they will obviously impart nearly exactly the same information to the model and by including both model will be weakened. No incremental information is added to the model instead model is infused with noise. Thus, pair of variables which are highly correlated can be removed to reduce dimensionality without much loss of information.
Dropping correlated variables from vehicle data:
vehicle_fn = vehicle_new.drop(['max.length_rectangularity', 'scaled_radius_of_gyration', 'scatter_ratio',
'elongatedness','pr.axis_rectangularity','scaled_variance.1', 'hollows_ratio'],
axis = 1)
vehicle_fn.shape
corr1 = vehicle_fn.corr()
cmap = cmap=sns.diverging_palette(255, 5, as_cmap=True)
fig, ax = plt.subplots(figsize=(14,8))
sns.heatmap(corr1, center=0, cmap=cmap, annot=True, annot_kws={"size": 13})
plt.show()
Depending upon data model requirements, threshold cut-off for correlated variables can be selected to remove multi-collinearity in the data.